I was given a dataset to work on as part of a coding challenge for SAP.iO recruitment! It’s a list of ~6000 wines, both white and red. I had to split it up into two seperate ones because R is not the best with memory allocation and my computer simply could not run the CSV file with all 6000 wines + features. This problem can be alleviated in the future by putting it in an AWS instance and use their cloud computers to be able to better run these models.
Preface - I did this mainly with the framework that I learned in my Intro to ML Class (Industrial Engineering 142.) They taught us how to code in R and also go through the steps of pre-processing, running regressions, builing the correlation plot, modeling itself, and then interpreting the confusion matrix.
I’m going to work with two models primarily for this - - The traditional K-nearest neighbors - randomForest.
The first thing I did is go into the CSV file itself is to split the red and white wines into two different datasets manually, though this can also be done with code and split into two different data frames.
# The below lines are to set up R so it uses all of my
# computer's cores in order to run the models much quicker.
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoParallel(cores = detectCores() - 1)
# Set seed is useful for creating simluations
set.seed(10)
# Loading all the required libraries for my analysis
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/Los_Angeles'
library(kknn)
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(corrplot)
## corrplot 0.84 loaded
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Using the read.csv function to read the data
df <- read.csv("red.csv")
# We don't want any empty cells in the data, so we will
# change all of the NA values to 0.
df[is.na(df)] <- 0
str(df)
## 'data.frame': 1599 obs. of 14 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ astringency.rating : num 0.81 0.86 0.85 1.14 0.81 0.8 0.85 0.79 0.83 0.8 ...
## $ residual.sugar : num 1.9 2.6 2.3 0 0 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ vintage : num 2001 2003 2006 2003 2004 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Running str(df) displays the internal structure of the red wine dataset. It shows that there are 1599 samples and 14 different variables. Everything is of datatype int aside from our response variable quality, which is an integer.
We now going to visualize the data using plots for each of the predictor variables.
for (i in c(1:12)) {
plot(df[, i], jitter(df[, "quality"]), xlab = names(df)[i],
ylab = "quality", cex = 0.5, cex.lab = 1)
abline(lm(df[, "quality"] ~ df[ ,i]), lty = 3, lwd = 3)
}
The line on each of these plots displays the linear regression of our response variable quality as a function of each of the predictor variables.
We can see that a few of the regression lines show a very weak association to our response variable. We’ll later split into training and test sets and then we can figure out if we want to keep those features or remove them. I created a correlation plot next to further look at the associations between all the variables.
cor_redwines <- cor(df)
# Had some trouble displaying the graph, so going to save as .png and
# then show in the R markdown file.
png(height = 1200, width = 1500, pointsize = 25, file = 'red_cor_plot.png')
corrplot(cor_redwines, method = 'number')
Here’s our graph You can see the weak relationships here between quality, citric.acid, free.sulplur dioxide, and also sulphates as shown in the plot. After processing through the data, we can continue on and say that non-linear classification models will be more appropriate than regression, because of all the weak associations shown in the correlation plot.
We need to convert our response variable to factor, and then do the split into training and testing sets.
df$quality <- as.factor(df$quality)
tr <- createDataPartition(df$quality, p = 2/3, list = F)
train_red <- df[tr,]
test_red <- df[-tr,]
We are going to go about this using both k-nearest neighors (KNN), along with randomForest. We will use the caret function which we loaded earlier to tune the model that we can use with the train function. We’ll repeat 5 times.
Caret simplifies the tuning of the model. The expand.grid argument which we’ll use below combines all of the hyperparameter values into all possible combos.
KNN uses distance, so we need to make sure all the predictor variables are standardized. We will use the preProcess argument in the train function for this.
For KNN, we’ll use 5 kmax, 2 distance, and 3 kernel values. For the distance, 1 is the Manhattan distance, and 2 is the Euclidian distance.
train_ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
kknn_grid <- expand.grid(kmax = c(3, 5, 7, 9, 11), distance = c(1, 2),
kernel = c("rectangular", "gaussian", "cos"))
kknn_train <- train(quality ~ ., data = train_red, method = "kknn",
trControl = train_ctrl, tuneGrid = kknn_grid,
preProcess = c("center", "scale"))
plot(kknn_train)
kknn_train$bestTune
## kmax distance kernel
## 26 11 1 gaussian
The best value for k is 7, after the three repetitions.
For Rf, the only parameter that we can mess around with is mtry, which is the number of vars which are randomly sampled at each split. We’ll try values of 1 through 13 to pass through the tuneGrid arguement.
rf_grid <- expand.grid(mtry = 1:13)
rf_train <- train(quality ~ ., data = train_red, method = "rf",
trcontrol = train_ctrl, tuneGrid = rf_grid,
preProcess = c("center", "scale"))
plot(rf_train)
rf_train$bestTune
## mtry
## 2 2
A mtry of 4 is the best value to use here.
kknn_predictor <- predict(kknn_train, test_red)
confusionMatrix(kknn_predictor, test_red$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 12 167 68 8 1
## 6 3 5 57 127 33 3
## 7 0 0 3 17 24 2
## 8 0 0 0 0 1 0
##
## Overall Statistics
##
## Accuracy : 0.5989
## 95% CI : (0.5558, 0.6408)
## No Information Rate : 0.4275
## P-Value [Acc > NIR] : 1.57e-15
##
## Kappa : 0.3442
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.00000 0.00000 0.7357 0.5991 0.36364 0.000000
## Specificity 1.00000 1.00000 0.7072 0.6834 0.95269 0.998095
## Pos Pred Value NaN NaN 0.6523 0.5570 0.52174 0.000000
## Neg Pred Value 0.99435 0.96798 0.7818 0.7195 0.91340 0.988679
## Prevalence 0.00565 0.03202 0.4275 0.3992 0.12429 0.011299
## Detection Rate 0.00000 0.00000 0.3145 0.2392 0.04520 0.000000
## Detection Prevalence 0.00000 0.00000 0.4821 0.4294 0.08663 0.001883
## Balanced Accuracy 0.50000 0.50000 0.7215 0.6412 0.65816 0.499048
rf_predict <- predict(rf_train, test_red)
confusionMatrix(rf_predict, test_red$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 1 11 185 56 4 0
## 6 2 6 41 143 33 3
## 7 0 0 1 13 29 3
## 8 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6723
## 95% CI : (0.6306, 0.7121)
## No Information Rate : 0.4275
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4636
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.00000 0.00000 0.8150 0.6745 0.43939 0.0000
## Specificity 1.00000 1.00000 0.7632 0.7335 0.96344 1.0000
## Pos Pred Value NaN NaN 0.7198 0.6272 0.63043 NaN
## Neg Pred Value 0.99435 0.96798 0.8467 0.7723 0.92371 0.9887
## Prevalence 0.00565 0.03202 0.4275 0.3992 0.12429 0.0113
## Detection Rate 0.00000 0.00000 0.3484 0.2693 0.05461 0.0000
## Detection Prevalence 0.00000 0.00000 0.4840 0.4294 0.08663 0.0000
## Balanced Accuracy 0.50000 0.50000 0.7891 0.7040 0.70142 0.5000
For the red wine dataset, the Random Forest Model was the one which performed the best, with an accuracy of almost 70% with a strong Kappa of .4275. The KNN was not better or worse.
==================================================================================================================
df1 <- read.csv("white.csv")
# changing NA's to 0's.
df1[is.na(df1)] <- 0
str(df1)
## 'data.frame': 4898 obs. of 14 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ astringency.rating : num 0.72 0.66 0.83 0.74 0.74 0.83 0.65 0.72 0.66 0.83 ...
## $ residual.sugar : num 0 0 6.9 0 8.5 6.9 7 0 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ vintage : num 2004 2004 2006 2004 2007 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Running str(df) on the wine dataset shows that there are 4898 samples, and 14 different variables.
Now going to visualize the data using plots for each of the predictor variables.
for (i in c(1:12)) {
plot(df1[, i], jitter(df1[, "quality"]), xlab = names(df1)[i],
ylab = "quality", cex = 0.5, cex.lab = 1)
abline(lm(df1[, "quality"] ~ df1[ ,i]), lty = 3, lwd = 3)
}
The line on each of these plots displays the linear regression of our response variable quality as a function of each of the predictor variables.
Again, there are a few regression lines which show a very weak association. Like before, we will first split into training and test sets and then we can figure out if we want to keep those features or remove them.
cor_white <- cor(df1)
png(height = 1200, width = 1500, pointsize = 25, file = 'white_cor_plot.png')
corrplot(cor_white, method = 'number')
Here’s our graph You can see the weak relationships here between quality, citric acid, residual sugar, free.sulplur dioxide, and also sulphates as shown in the plot. After looking at this data, after processing through the data, we can continue on and say that non-linear classification models will be more appropriate than regression.
We need to convert our response variable to factor, and then do the split into training and testing sets.
df1$quality <- as.factor(df1$quality)
tr_white <- createDataPartition(df1$quality, p = 2/3, list = F)
train_white <- df1[tr_white,]
test_white <- df1[-tr_white,]
We are going to go about this using both k-nearest neighors (KNN), along with randomForest. We will use the caret function which we loaded earlier to tune the model that we can use with the train function. We’ll repeat 5 times.
Caret simplifies the tuning of the model. The expand.grid argument which we’ll use below combines all of the hyperparameter values into all possible combos.
KNN uses distance, so we need to make sure all the predictor variables are standardized. We will use the preProcess argument in the train function for this.
For KNN, we’ll use 5 kmax, 2 distance, and 3 kernel values. For the distance, 1 is the Manhattan distance, and 2 is the Euclidian distance.
train_ctrl_white <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
kknn_grid_white <- expand.grid(kmax = c(3, 5, 7, 9, 11), distance = c(1, 2),
kernel = c("rectangular", "gaussian", "cos"))
kknn_train_white <- train(quality ~ ., data = train_white, method = "kknn",
trControl = train_ctrl_white, tuneGrid = kknn_grid_white,
preProcess = c("center", "scale"))
plot(kknn_train_white)
kknn_train_white$bestTune
## kmax distance kernel
## 15 7 1 cos
The best value for k is 9, after the 5 repetitions.
For this model, it seems that only the mtry hyperparameter is of use to us. We’ll pass mtry values of 1-13 into the train function’s tuneGrid arg.
rf_grid_white <- expand.grid(mtry = 1:13)
rf_train_white <- train(quality ~ ., data = train_white, method = "rf",
trcontrol = train_ctrl_white, tuneGrid = rf_grid_white,
preProcess = c("center", "scale"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
plot(rf_train_white)
rf_train_white$bestTune
## mtry
## 2 2
A mtry of 3 is the best value to use here.
We’ll plot the confusion matrix for both of the models to see which model we can use to get some sort of conclusive result from this dataset.
kknn_predict_white <- predict(kknn_train_white, test_white)
confusionMatrix(kknn_predict_white, test_white$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 2 9 6 7 0 0 0
## 5 3 28 286 163 13 6 1
## 6 1 14 173 463 112 21 0
## 7 0 3 19 87 158 15 0
## 8 0 0 1 12 10 16 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5721
## 95% CI : (0.5477, 0.5963)
## No Information Rate : 0.4494
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3516
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.166667 0.5897 0.6325 0.53925 0.275862
## Specificity 1.000000 0.990476 0.8129 0.6421 0.90719 0.985360
## Pos Pred Value NaN 0.375000 0.5720 0.5906 0.56028 0.410256
## Neg Pred Value 0.996317 0.971963 0.8237 0.6817 0.89978 0.973585
## Prevalence 0.003683 0.033149 0.2977 0.4494 0.17986 0.035605
## Detection Rate 0.000000 0.005525 0.1756 0.2842 0.09699 0.009822
## Detection Prevalence 0.000000 0.014733 0.3069 0.4813 0.17311 0.023941
## Balanced Accuracy 0.500000 0.578571 0.7013 0.6373 0.72322 0.630611
## Class: 9
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9993861
## Prevalence 0.0006139
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
rf_predict_white <- predict(rf_train_white, test_white)
confusionMatrix(rf_predict_white, test_white$quality)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 6 1 1 0 0 0
## 5 3 33 312 120 5 0 0
## 6 3 15 169 574 139 26 1
## 7 0 0 3 37 149 21 0
## 8 0 0 0 0 0 11 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6458
## 95% CI : (0.622, 0.669)
## No Information Rate : 0.4494
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4415
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.111111 0.6433 0.7842 0.50853 0.189655
## Specificity 1.000000 0.998730 0.8593 0.6065 0.95434 1.000000
## Pos Pred Value NaN 0.750000 0.6596 0.6192 0.70952 1.000000
## Neg Pred Value 0.996317 0.970389 0.8503 0.7749 0.89852 0.970952
## Prevalence 0.003683 0.033149 0.2977 0.4494 0.17986 0.035605
## Detection Rate 0.000000 0.003683 0.1915 0.3524 0.09147 0.006753
## Detection Prevalence 0.000000 0.004911 0.2904 0.5691 0.12891 0.006753
## Balanced Accuracy 0.500000 0.554921 0.7513 0.6953 0.73144 0.594828
## Class: 9
## Sensitivity 0.0000000
## Specificity 1.0000000
## Pos Pred Value NaN
## Neg Pred Value 0.9993861
## Prevalence 0.0006139
## Detection Rate 0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy 0.5000000
For white wine, the random forest model performed better. We have a 95% CI of (.6239, and .6709), and a Kappa level of 0.4451. KNN did not perform as well. Both did a rather poor job of identifying white wines of the 2 lowest and 2 highest classes. ================================================================================================================== # Finishing up
From our models here, we’ve learned that it’s only accurate to identify very average quality wines, rendering it not very useful. It is quite difficult to conclude that there can be a model that can accurately identify the low and high quality wine.